GH-101362: Optimise pathlib by deferring path normalisation #101560

barneygale · 2023-02-04T15:56:46Z

PurePath now normalises and splits paths only when necessary, e.g. when .name or .parent is accessed. The result is cached. This speeds up path object construction by around 4x.

PurePath.__fspath__() now returns an unnormalised path, which should be transparent to filesystem APIs (else pathlib's normalisation is broken!). This extends the earlier performance improvement to most impure Path methods, and also speeds up p.joinpath('bar') and p / 'bar'. edit: will fix separately.

~~This also fixes GH-76846 and GH-85281 by unifying path constructors and adding an __init__() method.~~ edit: will fix separately.

Issue: Optimize pathlib path construction #101362

`PurePath` now normalises and splits paths only when necessary, e.g. when `.name` or `.parent` is accessed. The result is cached. This speeds up path object construction by around 4x. `PurePath.__fspath__()` now returns an unnormalised path, which should be transparent to filesystem APIs (else pathlib's normalisation is broken!). This extends the earlier performance improvement to most impure `Path` methods, and also speeds up pickling, `p.joinpath('bar')` and `p / 'bar'`. This also fixes pythonGH-76846 and pythonGH-85281 by unifying path constructors and adding an `__init__()` method.

barneygale · 2023-02-04T18:26:26Z

Constructing path objects is up to 4x faster with one argument:

$ ./python -m timeit -n 1000000 -s 'from pathlib import PurePath' 'PurePath("foo/bar")' 
1000000 loops, best of 5: 2.01 usec per loop  # before
1000000 loops, best of 5: 495 nsec per loop  # after

More than 2x faster with two arguments:

$ ./python -m timeit -n 1000000 -s 'from pathlib import PurePath' 'PurePath("foo", "bar")' 
1000000 loops, best of 5: 2.28 usec per loop  # before
1000000 loops, best of 5: 1.02 usec per loop  # after

~~And ~25% faster when joining arguments:~~

[edit: no longer true! ]

$ ./python -m timeit -n 1000000 -s 'from pathlib import PurePath; p = PurePath("foo")' 'p.joinpath("bar")' 
1000000 loops, best of 5: 1.66 usec per loop  # before
1000000 loops, best of 5: 1.3 usec per loop  # after

But it's 12% slower when the path needs normalization, as with str()

$ ./python -m timeit -n 1000000 -s 'from pathlib import PurePath' 'str(PurePath("foo/bar"))' 
1000000 loops, best of 5: 2.96 usec per loop  # before
1000000 loops, best of 5: 3.31 usec per loop  # after

~~And 25% slower when when walking directories (where pathlib keeps everything normalized):~~

[edit: resolved! see comment]

$ ./python -m timeit -n 20 -s 'from pathlib import Path' 'list(Path().rglob("*"))' 
20 loops, best of 5: 53.4 msec per loop  # before
20 loops, best of 5: 66.5 msec per loop  # after

~~But still faster for filesystem operations that don't require normalization:~~

[edit: no longer true! this can't be properly fixed until other stuff lands]

$ ./python -m timeit -n 100000 -s 'from pathlib import Path' 'Path("README.rst").read_text()' 
100000 loops, best of 5: 26.1 usec per loop  # before
100000 loops, best of 5: 21.2 usec per loop  # after

$ ./python -m timeit -n 100000 -s 'from pathlib import Path' 'Path("README.rst").exists()' 
100000 loops, best of 5: 5.45 usec per loop  # before
100000 loops, best of 5: 2.97 usec per loop  # after

barneygale · 2023-02-07T20:47:26Z

I've found a couple other small optimizations which are best tackled in other PRs, so I'm marking this PR as a 'draft' for now.

barneygale · 2023-03-06T02:31:58Z

I've undone the change to _from_parsed_parts(), which has restored directory-walking performance:

$ ./python -m timeit -n 20 -s 'from pathlib import Path' 'list(Path().rglob("*"))' 
20 loops, best of 5: 146 msec per loop  # before
20 loops, best of 5: 152 msec per loop  # after

Still a tiny bit slower than pre-PR.

The rest of the speedups/slowdowns mentioned in my previous comment are still there.

barneygale · 2023-03-11T23:32:39Z

The change to importlib is necessary because it's relying on a bug in pathlib's path normalization:

pathlib strips trailing slash #65238

I think I need to solve that issue first, so I'm going to mark this PR as a draft (again!)

barneygale · 2023-03-17T19:22:58Z

This PR has strayed too far from the original implementation. I'm going to abandon it. New PR here:

GH-76846, GH-85281: Call __new__() and __init__() on pathlib subclasses #102789

bedevere-bot added the awaiting review label Feb 4, 2023

bedevere-bot mentioned this pull request Feb 4, 2023

Optimize pathlib path construction #101362

Closed

barneygale added 4 commits February 4, 2023 16:32

Restore str force-casting behaviour; reduce diff a little.

e49e719

Fix pathlib usage error in importlib

a931986

Improve initialiser performance.

38f70bf

Add NEWS blurb

10844a6

barneygale marked this pull request as ready for review February 4, 2023 18:28

barneygale requested review from jaraco and warsaw as code owners February 4, 2023 18:28

barneygale requested a review from AlexWaygood February 4, 2023 18:30

Store '_fspath' as non-empty string.

d5231b6

barneygale marked this pull request as draft February 7, 2023 20:35

barneygale added the topic-pathlib label Feb 7, 2023

barneygale added 2 commits March 6, 2023 02:00

Merge branch 'main' into optimize-pathlib-part-2b

76b2db7

Undo addition of __init__() and change to _from_parsed_parts()

cc2c711

barneygale changed the title ~~GH-101362 - Optimise pathlib by deferring path normalisation~~ GH-101362: Optimise pathlib by deferring path normalisation Mar 6, 2023

AlexWaygood added the performance Performance or resource usage label Mar 6, 2023

Fix pickling of paths created via walking.

cbf0fcd

barneygale marked this pull request as ready for review March 6, 2023 02:33

Simplify patch slightly

eb7087f

barneygale mentioned this pull request Mar 10, 2023

GH-78079: Fix UNC device path root normalization in pathlib #102003

Merged

barneygale added 3 commits March 11, 2023 21:55

Remove unused import

3b53e27

Merge branch 'main' into optimize-pathlib-part-2b

dec98f2

Fix dodgy merge, add comment about _from_parsed_parts()

d9a6080

barneygale marked this pull request as draft March 11, 2023 23:32

This was referenced Mar 12, 2023

GH-101362: Omit path anchor from pathlib.PurePath()._parts #102476

Merged

pathlib strips trailing slash #65238

Closed

barneygale added 2 commits March 17, 2023 15:43

Stop returning unnormalised path from __fspath__()

9650dca

Merge branch 'main' into optimize-pathlib-part-2b

af27476

barneygale marked this pull request as ready for review March 17, 2023 16:20

barneygale mentioned this pull request Mar 17, 2023

Optimize pathlib.PurePath.__fspath__() #102783

Closed

barneygale marked this pull request as draft March 17, 2023 16:45

Undo change to joinpath(), fix news blurb.

97995bd

barneygale closed this Mar 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-101362: Optimise pathlib by deferring path normalisation #101560

GH-101362: Optimise pathlib by deferring path normalisation #101560

barneygale commented Feb 4, 2023 •

edited

Loading

barneygale commented Feb 4, 2023 •

edited

Loading

barneygale commented Feb 7, 2023

barneygale commented Mar 6, 2023

barneygale commented Mar 11, 2023

barneygale commented Mar 17, 2023

GH-101362: Optimise pathlib by deferring path normalisation #101560

GH-101362: Optimise pathlib by deferring path normalisation #101560

Conversation

barneygale commented Feb 4, 2023 • edited Loading

barneygale commented Feb 4, 2023 • edited Loading

barneygale commented Feb 7, 2023

barneygale commented Mar 6, 2023

barneygale commented Mar 11, 2023

barneygale commented Mar 17, 2023

barneygale commented Feb 4, 2023 •

edited

Loading

barneygale commented Feb 4, 2023 •

edited

Loading